Students:
Airbnb operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. It allows people (hosts) to list their propoerties for short-term rentals and earns money through the commission for each booking. The business model is based on the idea that these rentals are cheaper than hotels, making the company a threat to hotel industry. The value proposition to hosts is side income, while for the guests it is cheaper accomodation.
The market of these short-term rentals in cities such as New York is highly competitive because renters are presented with a broad range of selection of listings for their specific criteria. Since Airbnb is a marketplace, the dynamics of marketplace have a huge influence on the the amount a host can charge on a nightly basis. This is in fact one of the biggest challenges for the hosts: deciding the prices for their listing. If they charge above the market place, they will lose out on revenue as the renters will most liklely find another affordable alternatives. If the price is set too low, then they again lose out on profits. Additionally, renters may lose out on the opprtunity to live at a great place.
For this project, our goal is build a regression model that can accurately predict the price of the listing, which will:
1) Help existing hosts adjusts their prices
2) New hosts decide on a price
Additionally, the machine learning algorithms will provide insight into what factors influence the pricing of these rentals. If they are something that can be controlled by the hosts, then they can use the insights from this analysis to improve those factors and provide better accomodation to guests. For the scope of this project, we will only be looking at properties in New York considering that New York is a highly competitive marketplace for Airbnb. We will also be seeking to answer the following hypotheses by performing Causal ML:
1) Does the 'Starbucks Effect' affect the price of Airbnb listings?
2) Does the distance to the nearest metro station affect the price of Airbnb listings?
Link to the dataset:http://insideairbnb.com/get-the-data.html
IMPORT PACKAGES
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
# Common imports
import numpy as np
import os
#Pandas Profiling
#!pip install pandas_profiling
import pandas_profiling
# To plot pretty figures
#!pip install -U seaborn
import seaborn as sns
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Ignore useless warnings (see SciPy issue #5998)
import warnings
import gc
warnings.simplefilter(action='ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline
#Display multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
IMPORT DATA
import pandas as pd
df1=pd.read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2021-02-04/data/listings.csv.gz')
df1.head()
df1.shape
DATA DICTIONARY
Since there was no official data dictionary, we used Airbnb's webiste to interpret some of the features. Considering there are 74 variables including a lot of variables such a different ids and urls and granular information regarding host (id, url, picture_url etc.) that would not be used in the analysis, we decided to drop some of these. We have only provided dictionary for the variables that we will be using and are not explainable by their names.
df1.columns
df1 = df1.drop(columns = ['id', 'listing_url', 'scrape_id',
'picture_url', 'host_id', 'host_url', 'host_name','host_location',
'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
'host_listings_count', 'neighbourhood',
'bathrooms', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
'calendar_last_scraped', 'number_of_reviews_l30d', 'license',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms'])
df1.head()
df1.columns
The first few variables are related to host, which include his/her details like their response rate, information regarding their profile. Then the next few variables are regarding the property itself. The next few variables are not very intuitive and their description is given below:
accommodates - how many people the property accommodates
minimum_nights - number of minimum nights guests have to stay
maximum_nights - number of maximum nights guests are allowed to stay
number_of_reviews_ltm - Number of reviews in last 12 months
number_of_reviews_l30d - umber of reviews in last 30 days
first_review - Date first review was posted
last_review - Date last review was posted
review_scores_rating - Rating of host for overall experience
review_scores_accuracy - Rating of host for accuracy of listings
review_scores_cleanliness - Rating of host for cleanliness
review_scores_checkin - Rating of host for checking experience
review_scores_communication - Rating of host for communication
review_scores_location - Rating score for location
review_scores_value - Rating for property's worth (value)
instant_bookable - If property can be instantly booked (i.e. booked straight away, without having to message the host first and wait to be accepted)
reviews_per_month - Reviews per month
CHECK DATA TYPES
df1.dtypes
We observe that, the following columns need to be changed to the correect data type.
- first_review, host_since, last_review --> changed to 'date'.
- host_response_rate, host_acceptance_rate, price --> changed to a 'numerical' column
#change variables to 'date' type
df1['host_since']=pd.to_datetime(df1['host_since'])
df1['first_review']=pd.to_datetime(df1['first_review'])
df1['last_review']=pd.to_datetime(df1['last_review'])
df1.dtypes
#change variables to numerical
#get all the non-null values and Convert the object datatype to numerical datatype: host_response_rate, host_acceptance_rate
df1['host_response_rate'] = df1['host_response_rate'].astype(str).str.replace('%', '').astype(float)
df1['host_acceptance_rate'] = df1['host_acceptance_rate'].astype(str).str.replace('%', '').astype(float)
df1['price'] = df1['price'].str.replace(',', '').str.replace('$', '').astype(float)
df1.dtypes.value_counts().sort_values().plot(kind='barh',
figsize=(20, 6),
fontsize=16,
color="midnightblue")
plt.title('Number of columns by data types', fontsize=18)
plt.xlabel('Number of columns', fontsize=16)
plt.ylabel('Data type', fontsize=16)
SUMMARY STATISTICS
df1.describe()
PANDAS PROFILING
#Generate a HTML report
profile = df1.profile_report(title='Pandas Profiling Report')
#profile
profile.to_file(output_file="profile_report_output.html")
Insights from the profile report:
1. A lot of variables regarding host have missing values; bathrooms variables is all null values
2. Room type and propert type are highly correlated
3. Avialbility_30, Availability_60 and Availability_90 are highly correlated
SOME DATA CLEANING
A lot of variables include text, hence, they will be need to processed by either modifying them or creating new variables out of them. Most of these columns included description regarding the property or neighbourhood or about the host. Some columns are just dates so having differences between those dates would be more useful.
#=======================================================Data Cleansing===================================================
#Only pick the Airbnb apartment with reviews
#df1=df1[df1['number_of_reviews'].astype(int)>0]
#Only pick the Airbnb apartment with price
#df1=df1[df1['price']>0]
#Only pick the Airbnb with the answer(t/f) for "host_is_superhost"
#df1=df1[df1['host_is_superhost'].apply(lambda x: len(str(x))==1)]
#Drop other answer except f/t in "instant_bookable"
#df1=df1[df1['instant_bookable'].isin(['f','t'])]
#To drop the review score which lower than 21 (potential outliers)
#df1=df1[df1['review_scores_rating']>21]
#Replace all the blank cell with NaN value
df1=df1.replace('',np.NaN)
#========================================================Add new features===============================================
#Get the length of the sentence in following five columns (number of words)
df1['name_length'] = df1['name'].apply(lambda x: len(str(x).split()))
df1['description_length']=df1['description'].apply(lambda x: len(str(x).split()))
df1['host_about_length']=df1['host_about'].apply(lambda x: len(str(x).split()))
df1['verifications_length']=df1['host_verifications'].apply(lambda x: len(str(x).split(',')))
df1['amenities_length']=df1['amenities'].apply(lambda x: len(str(x).split(',')))
#Get the difference between the "last_scraped" with the following dates (in days)
df1['host_since_days'] = (pd.to_datetime(df1['last_scraped'])-pd.to_datetime(df1['host_since'])).dt.days
df1['first_review_days'] = (pd.to_datetime(df1['last_scraped'])-pd.to_datetime(df1['first_review'])).dt.days
df1['last_review_days'] = (pd.to_datetime(df1['last_scraped'])-pd.to_datetime(df1['last_review'])).dt.days
#drop 'last_scraped, host_since, first_reviews, last_review'
df1 = df1.drop(columns = ['last_scraped', 'host_since','first_review','last_review'])
#Get the new column to express the price per accommodate
df1['price_per_accommodates']=df1['price']/df1['accommodates'].astype(float)
#=====================================================Change the data type=============================================
#Convert the categorical columns to dummified columns
list_col=['host_is_superhost','host_identity_verified','instant_bookable', 'host_has_profile_pic']
for i in list_col:
df1[i] = np.where(df1[i]== 't', 1, 0)
CHECK MISSING VALUES
def missing_values(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(columns={
0: 'Missing Values',
1: '% of Total Values'
})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print("Dataframe has " + str(df.shape[1]) + " columns.")
print("There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
return mis_val_table_ren_columns
# Missing values statistics
miss_values = missing_values(df1)
miss_values
import missingno as msno
#msno.matrix(df.sample(500), figsize=(12,8))
msno.bar(df1, figsize=(10,6), color='midnightblue')
TARGET VARIABLE Check distribution of 'Price' variable
#df1['price'].value_counts().plot(kind='bar', color='midnightblue')
plt.figure(figsize=(10,7))
sns.distplot(df1.price)
The target variable is very heavily right skewed!
CORRELATIONS between predictors and target variable
corr = df1.corr()['price'].sort_values()
corr
There is no multicollinearity between the target variable and the predictors. All correlation values are below 0.8
CATEGORICAL VARIABLES
Number of categories in each categorical variable
df1.select_dtypes('object').apply(pd.Series.nunique, axis=0)
df1['host_response_time'].value_counts()
df1['neighbourhood_group_cleansed'].value_counts()
df1['property_type'].value_counts().head(50)
df1['room_type'].value_counts()
To make visualization easier and more insightful
BATHROOM TEXT
Deriving the number and type of bathroom from the 'bathroom_text' variable
df1['bathrooms_text']=df1['bathrooms_text'].astype(str)
df1['bathrooms_text']=df1['bathrooms_text'].replace('nan','nan nan')
df1['bathrooms_list'] = df1['bathrooms_text'].apply(lambda x: (x.split(" ", 1)))
df1['bathrooms_list']
new_val=[]
for lst in np.array(df1['bathrooms_list']):
if len(lst)!=2:
lst.append(" ")
new_val.append(lst)
df1['bathrooms_list']=new_val
list_num=[]
list_name=[]
for i in np.array(df1['bathrooms_list']):
list_num.append(i[0])
list_name.append(i[1])
df1['num_bath']=list_num
df1['name_bath']=list_name
df1=df1.drop('bathrooms_list',axis=1)
df1['num_bath'].value_counts().head(20)
#Replace some values to make them numerical
df1['num_bath'] = df1['num_bath'].replace({'Half-bath': 0.5, 'Shared':0.5, 'Private':1})
df1['num_bath'].value_counts().head(20)
df1['name_bath'].value_counts().head(20)
#replace some duplicate categories
df1['name_bath'] = df1['name_bath'].replace({'shared baths': 'shared bath', 'baths':'private bath', 'bath':'private bath'})
df1['name_bath'].value_counts().head(20)
PROPERTY TYPE
df1.property_type.replace({
'Tiny house': 'House',
'Shared room in townhouse':'Townhouse',
'Private room in earth house' :'Other',
'Shared room in serviced apartment' :'Apartment',
'Private room in bungalow' :'Bunglow',
'Entire cottage' :'Other',
'Houseboat' :'Other',
'Entire villa' :'House',
'Boat' :'Other',
'Entire home/apt' :'Other',
'Private room in casa particular' :'Other',
'Private room in floor' :'Other',
'Shared room in bed and breakfast' :'Other',
'Private room in barn' :'Other',
'Private room in castle' :'Other',
'Private room in cottage' :'Other',
'Barn' :'Other',
'Cave' :'Other',
'Private room in cabin' :'Other',
'Shared room in guest suite':'Other',
'Private room in dome house' :'Other',
'Shared room in guesthouse' :'Other',
'Private room in dorm' :'Other',
'Lighthouse' :'Other',
'Shared room in island' :'Other',
'Room in resort' :'Other',
'Bus' :'Other',
'Shared room in earth house' :'Other',
'Private room in camper/rv' :'Other',
'Shared room in bungalow' :'Other',
'Private room in train' :'Other',
'Private room in farm stay' :'Other',
'Private room in in-law' :'Other',
'Private room in lighthouse' :'Other',
'Private room in tent' :'Other',
'Entire bed and breakfast' :'Other',
'Room in hostel' :'Other',
'Shared room in floor':'Other',
'Private room in bed and breakfast': 'Room in bed and breakfast',
'Entire place':'House',
'Shared room in condominium': 'Condo',
'Private room' :'Private room in house',
'Camper/RV':'Other',
'Private room in villa' : 'Villa',
'Entire bungalow': 'Bunglow',
'Entire floor':'House',
'Entire resort': 'Other',
'Private room in tiny house':'Other'
}, inplace=True)
df1['property_type'].value_counts()
SENTIMENT SCORE OF HOST AND PROPERTY DESCRIPTIONS
#!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
score = analyser.polarity_scores(sentence)
return score['compound']
dfa=df1[['name','description','neighborhood_overview','host_about']]
dfa['name'] = dfa['name'].fillna("Unknown")
dfa['description'] = dfa['description'].fillna("Unknown")
dfa['neighborhood_overview'] = dfa['neighborhood_overview'].fillna("Unknown")
dfa['host_about'] = dfa['host_about'].fillna("Unknown")
dfa['name_sentiment'] = dfa.apply(lambda row : sentiment_analyzer_scores(row['name']), axis = 1)
dfa['description_sentiment'] = dfa.apply(lambda row : sentiment_analyzer_scores(row['description']), axis = 1)
dfa['neighborhood_sentiment'] = dfa.apply(lambda row : sentiment_analyzer_scores(row['neighborhood_overview']), axis = 1)
dfa['hostabout_sentiment'] = dfa.apply(lambda row : sentiment_analyzer_scores(row['host_about']), axis = 1)
dfa['sentiment'] = (dfa['name_sentiment']+dfa['description_sentiment']+dfa['neighborhood_sentiment']+dfa['hostabout_sentiment'])/4
dfa['sentiment']
#Add the sentiment values into the dataframe
df1['total_sentiment'] = np.NaN
df1['name_sentiment'] = np.NaN
df1['description_sentiment'] = np.NaN
df1['neighborhood_sentiment'] = np.NaN
df1['hostabout_sentiment'] = np.NaN
df1['total_sentiment'].loc[dfa.index] = dfa['sentiment']
df1['name_sentiment'].loc[dfa.index] = dfa['name_sentiment']
df1['description_sentiment'].loc[dfa.index] = dfa['description_sentiment']
df1['neighborhood_sentiment'].loc[dfa.index] = dfa['neighborhood_sentiment']
df1['hostabout_sentiment'].loc[dfa.index] = dfa['hostabout_sentiment']
df1
AMENITIES
Amenities are all present in a list. Although we have calculated the length of list of amenities in a list, it would be useful to see what amenities are usually listed and make those categorical variables.
#creating set of all amenties
amenities = list(df1.amenities)
amenities_list = " ".join(amenities)
amenities_list = amenities_list.replace('[', '')
amenities_list = amenities_list.replace(']', ',')
amenities_list= amenities_list.replace('"', '')
amenities_set = [x.strip() for x in amenities_list.split(',')]
amenities_set = set(amenities_set)
amenities_set
In the list above, some amenities are more important than others (e.g. a parking lot is more important that a shampoo). Based on research and personal experiences, some of the most important amentities will be selected. Some of the amenities such as wifi, stove top are standard across all listings so they weren't included in the list
The amenities chosen are (slashes indicate those categories that can be combined):
Air conditioning/Central air conditioning
BBQ grill
Patio
beachfront/lake access
Breakfast/Complimentary breakfast buffet/ Complimentary continental breakfast/ Complimentary hot breakfast
Cable TV/TV
Coffee maker/ Keurig coffee machine
Breakfast/Complimentary breakfast buffet/ Complimentary continental breakfast/ Complimentary hot breakfast
Cooking basics
Dishwasher/Dryer/Washer
Gym/Private gym/Shared gym/ Shared gym in building/
Free parking on premises/Free street parking/outdoor parking/paid parking off premises/paid parking on premises
Hot tub/Private hot tub/shared hot tub/Shared pool/Shared sauna/private hot tub
Long term stays allowed
Pets allowed
Private entrance
Safe/security system
Microwave
import requests
import nltk
import nltk.corpus
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
df = df1[['amenities']]
df.head()
df = df[df['amenities'].notnull()]
#expand contraction words
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won\'t", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
df['pros1'] = df.apply(lambda row : decontracted(row['amenities']), axis = 1)
#Tokenization the comments column
def token_(x):
token = word_tokenize(x)
return token
df['pros_token'] = df.apply(lambda row : token_(row['pros1']), axis = 1)
# Lower Casing the Tokenized comments
def lower_case(x):
ret = []
for words in x:
words = words.lower()
ret.append(words)
return ret
df['pros_token'] = df.apply(lambda row : lower_case(row['pros_token']), axis = 1)
# Removing Punctuation
import re
punctutation = re.compile(r'[-.?!,:;()%\/|0-9""]')
def post_punctutation(x):
ret = []
for words in x:
item = punctutation.sub("",words)
if len(item)>0:
ret.append(item)
return ret
df['pros_token'] = df.apply(lambda row : post_punctutation(row['pros_token']), axis = 1)
#len(df['Comment_token_punct'][0]), len(df['Comment_token'][0])
#Stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(x):
filtered_sentence = []
for w in x:
if w not in stop_words:
filtered_sentence.append(w)
return filtered_sentence
df['pros_stopwords'] = df.apply(lambda row : remove_stopwords(row['pros_token']), axis = 1)
#len(df['Comment_token_punct_stopwords'][0]),len(df['Comment_token_punct'][0]),
#POS Tagging
nltk.download('averaged_perceptron_tagger')
df['pros_tags'] = df['pros_stopwords'].apply(nltk.tag.pos_tag)
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
df['wordnet_pos'] = df['pros_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
wnl = WordNetLemmatizer()
df['lemmatized'] = df['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
# ALl the duplicate words will be removed from the text/comments including all the adjectives and verbs.
def unique_(test_list):
res = []
for i in test_list:
if i not in res:
res.append(i)
return res
df['pros_unique'] = df.apply(lambda row : unique_(row['lemmatized']), axis = 1)
#len(df['Comment_token_punct_stopwords_unique'][0]),len(df['lemmatized'][0]),
#select nouns only
df['nouns'] = df['wordnet_pos'].apply(lambda x: [word for (word, pos) in x if pos[0] == 'n'])
text_list=df['nouns'].tolist()
#print(text_list[0])
#another method
from nltk.probability import FreqDist
fdist = FreqDist()
for i in range(len(df)):
for word in text_list[i]:
fdist[word]+=1
word_freqs2 = pd.DataFrame(fdist.items(), columns = ['word', 'frequency']).sort_values(by = ['frequency'], ascending = False)
word_freqs2.head(30)
In addition to the analysis above, we did some research and came up with the top amenities that most guests look for
list_name=['air_conditioning_available','bbq_available','patio','beach','breakfast_available',
'tv_available','coffee_machine_available', 'cooking_basics','dishwasher_available',
'washer and dryer_available','gym','parking','hot_tub_sauna_or_pool','long_term_stays_allowed',
'pets_allowed','private_entrance','secure','microwave_available']
for i in list_name:
df1[i]=[0]*len(df1)
df1.loc[df1['amenities'].str.contains('Air conditioning|Central air conditioning'), 'air_conditioning_available'] = 1
df1.loc[df1['amenities'].str.contains('BBQ grill'), 'bbq_available'] = 1
df1.loc[df1['amenities'].str.contains('Patio'), 'patio'] = 1
df1.loc[df1['amenities'].str.contains('Beachfront|Lake access'), 'beach'] = 1
df1.loc[df1['amenities'].str.contains('Breakfast|Complimentary breakfast buffet|Complimentary continental breakfast|Complimentary hot breakfast'), 'breakfast_available'] = 1
df1.loc[df1['amenities'].str.contains('TV|Cable TV'), 'tv_available'] = 1
df1.loc[df1['amenities'].str.contains('Coffee maker|Keurig coffee machine'), 'coffee_machine_available'] = 1
df1.loc[df1['amenities'].str.contains('Cooking basics'), 'cooking_basics'] = 1
df1.loc[df1['amenities'].str.contains('Dishwasher'), 'dishwasher_available'] = 1
df1.loc[df1['amenities'].str.contains('Dryer|Washer'), 'washer and dryer_available'] = 1
df1.loc[df1['amenities'].str.contains('Gym|gym|Gym/Private gym|Shared gym|Shared gym in building'), 'gym'] = 1
df1.loc[df1['amenities'].str.contains('Free parking on premises|Free street parking|outdoor parking|paid parking off premises|paid parking on premise'), 'parking'] = 1
df1.loc[df1['amenities'].str.contains('Hot tub|Private hot tub|shared hot tub|Shared pool|Shared sauna|private hot tub'), 'hot_tub_sauna_or_pool'] = 1
df1.loc[df1['amenities'].str.contains('Long term stays allowed'), 'long_term_stays_allowed'] = 1
df1.loc[df1['amenities'].str.contains('pets allowed'), 'pets_allowed'] = 1
df1.loc[df1['amenities'].str.contains('Private entrance'), 'private_entrance'] = 1
df1.loc[df1['amenities'].str.contains('Safe|Security system'), 'secure'] = 1
df1.loc[df1['amenities'].str.contains('Microwave'), 'microwave_available'] = 1
df1
df1.columns
#Determining which amenities are present in less than 10% of listings
# Replacing nulls with zeros for new columns
nulls_replace = df1.iloc[:,57:].columns
#nulls_replace_list = nulls_replace.to_list()
#df1[nulls_replace] = df1[nulls_replace].fillna(0)
# Produces a list of amenity features where one category (true or false) contains fewer than 10% of listings
fewer_amenities = []
for col in nulls_replace:
if df1[col].sum() < len(df1)/10:
fewer_amenities.append(col)
print("Fewer amenities include", fewer_amenities)
# Dropping infrequent amenity features
#test = df1.drop(columns=['bbq_available', 'patio', 'beach', 'breakfast_available', 'hot_tub_sauna_or_pool', 'pets_allowed', 'secure'], axis=1, inplace=True)
In the pre-processing stage, we will need to delete the 'fewer amenities'
So far we have added all the necessary features. Now we can drop some features that might not be used in the analysis. We can also look for correlated variables and drop those.
#creating a new dataset so that if we need some columns later we can use df1
df2 = df1.copy(deep = True)
df2
Consideing we have already added difference between days we can get rid last_scraped,host_since, first_review, last_review. Additionally, for all the columns with text, we have laready calculated their length so we cam drop those columns as well, such as name, description, neighbourhood_overview, amenities.
df2 = df2.drop(columns = ['name', 'description', 'host_about','neighborhood_overview','amenities','property_type'])
import seaborn as sns
data_corr = df2.corr()
plt.figure(figsize=(30, 15))
heatmap = sns.heatmap(data_corr, vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)
corr_matrix = df2.corr()
mat = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
mat
Anything above 75% is considered multicollinear. From the above results, we can see that availability_30, availability_60, and availability_90 are highly correlated to each other. Since New York has recently applied a rule that no rental can be less than 30 days, its better to keep availablity_90 and drop the rest. review_scores_ratings, review_scores_accuracy, and review_scores_values are correlated as well so onlt review_score_accuracy will be kept.
df2 = df2.drop(columns = ['availability_30', 'availability_60','review_scores_rating','review_scores_value'])
corr_matrix=df2.corr()
corr_matrix["price"].sort_values(ascending=False)
None of the newly added variables correlate highly with the price, so this is good!
Visualization of all predictors
#sns.set(style="ticks")
#sns.pairplot(df2, hue="price", palette="Set1")
#plt.show()
Visualize distribution of numerical variables
num_vars = df2.select_dtypes('int64', 'float64')
num_vars.hist(bins=20, figsize=(20,15), color='midnightblue')
plt.show();
Visualize distribution of Categorical Variables *Please Note that this take a VERY long time to run!
cat_vars = df2.select_dtypes('object')
fig, axes = plt.subplots(round(len(cat_vars.columns) / 4), 4, figsize=(20, 15))
for i, ax in enumerate(fig.axes):
if i < len(cat_vars.columns):
cat_vars[cat_vars.columns[i]].value_counts().plot.pie(autopct = '%1.1f%%',ax = ax, colormap='tab20b')
#ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45)
ax.set_title(cat_vars.columns[i])
fig.tight_layout();
Does score of review rating have an impact on price?
plt.figure(figsize=(10,7))
plt.scatter(x='review_scores_rating', y="price", data=df1)
plt.title('Price as a function of Review Scores Rating') #title
plt.xlabel('Review Scores Rating') #x label
plt.ylabel('Price') #y la
From the scatter plot above, it doesn't seem like there is a relationship between price and review scores rating. Hence, it is concluded that it would not be useful to predict rating before predicting price. This observation will be tested during feature analysis still.
Geographic Distribution
df1.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
plt.savefig("better_visualization_plot")
df1.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=df1["number_of_reviews"], label="number_of_reviews", figsize=(10,7),
c="review_scores_rating", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
plt.savefig("Airbnb_review_rating_scatterplot")
Let's see if there are any amenities that particularly have any impact on the price
fig = plt.figure(figsize=(20,10))
graphs=sns.kdeplot(x = 'price', data=df2, hue = 'long_term_stays_allowed')
fig = plt.figure(figsize=(20,10))
graphs=sns.kdeplot(x = 'price', data=df2, hue = 'air_conditioning_available')
fig = plt.figure(figsize=(20,10))
graphs=sns.kdeplot(x = 'price', data=df2, hue = 'gym', gridsize = 2000)
fig = plt.figure(figsize=(20,10))
graphs=sns.kdeplot(x = 'price', data=df2, hue = 'parking')
The distribution of price for the presence and absence of these amenities are similar, implying that there many not be any amenity that leads to higher price.
Effect on amenities length on price
plt.figure(figsize=(10,7))
plt.scatter(x='amenities_length', y="price", data=df1)
plt.title('Price as a function of length of amenities') #title
plt.xlabel('length of Amenities') #x label
plt.ylabel('Price') #y la
From the scatter plot above, it is hard to interpret if there is a relationship between length of amenities and price. We will further test this observation during feature engineering.
#Drop variables that are duplicates
df2 = df2.drop(columns = ['host_verifications','neighbourhood_cleansed', 'latitude','longitude', 'bathrooms_text'])
df3 = df2.copy(deep = True)
df3.shape
import numpy as np
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set,valid_set=split_train_test(df3,0.3)
print("The length of train set is: ",len(train_set))
print("The length of valid set is: ",len(valid_set))
valid_set,test_set=split_train_test(valid_set,0.4)
print("The length of valid set is: ",len(valid_set))
print("The length of test set is: ",len(test_set))
train_set.shape, test_set.shape
# Missing values statistics
miss_values = missing_values(train_set)
miss_values
FLAGGING MISSING VALUES
cols = miss_values.index
df_try = train_set[cols].isnull().astype(int).add_suffix('_indicator')
#df_try
#merge both the df1 and the flagged columns
train_set = pd.merge(train_set, df_try, left_index=True, right_index=True)
train_set.head(10)
ITERATIVE IMPUTER For numerical variables
train_set.columns
#choose numerical variables only
df_num = train_set.drop(columns=['host_response_time', 'neighbourhood_group_cleansed','name_bath', 'room_type'])
#df_num=df3[['host_response_rate', 'host_acceptance_rate','review_scores_value', 'review_scores_location',
# 'review_scores_checkin','review_scores_accuracy','review_scores_communication',
# 'review_scores_cleanliness','review_scores_rating','reviews_per_month','first_review_days',
# 'last_review_days','bedrooms','beds','host_total_listings_count','host_since_days']]
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
# Re-check Missing values statistics
miss_values = missing_values(df_num1)
miss_values.head(20)
Now, replace the incomplete columns in train_set with the corresponding imputed columns from df_num1
train_set[cols] = df_num1[cols].values
train_set.head(5)
# Re-check Missing values statistics
miss_values = missing_values(train_set)
miss_values.head(20)
We have taken care of the numerical values, and need to work on the categorical values next...
train_set['host_response_time'].mode()
train_set['host_response_time'] = train_set['host_response_time'].fillna("within an hour")
# Check Missing values statistics again
miss_values = missing_values(train_set)
miss_values.head(20)
train_set = pd.get_dummies(train_set, columns=['host_response_time','neighbourhood_group_cleansed','name_bath', 'room_type'])
train_set.head(5)
Correlations
corr_matrix = train_set.corr()
mat = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
mat
data_corr2 = train_set.corr()
plt.figure(figsize=(30, 15))
heatmap1 = sns.heatmap(data_corr2, vmin=-1, vmax=1, annot=True)
heatmap1.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)
Dropping highy correlated variables
train_set = train_set.drop(columns = ['host_response_rate_indicator','host_since_days_indicator','reviews_per_month_indicator','first_review_days_indicator','reviews_per_month_indicator','review_scores_cleanliness_indicator','review_scores_accuracy_indicator','review_scores_checkin_indicator','review_scores_communication_indicator','review_scores_accuracy_indicator','room_type_Private room','maximum_nights','name_bath_shared bath', 'host_acceptance_rate'])
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=100, random_state=42, contamination=0.02)
pred = iforest.fit_predict(train_set)
score = iforest.decision_function(train_set)
from numpy import where
anom_index = where(pred== -1)
values = train_set.iloc[anom_index]
values
NOTE: Out of 22208 observations, there are 519 outliers. We will remove the outliers.
train_set = train_set[~train_set.index.isin(values.index)]
train_set.shape
#train_set.columns
#Flag Missing Values
miss_values = missing_values(valid_set)
cols = miss_values.index
df_try = valid_set[cols].isnull().astype(int).add_suffix('_indicator')
#merge both the df1 and the flagged columns
valid_set = pd.merge(valid_set, df_try, left_index=True, right_index=True)
#Iterative Imputer
df_num = valid_set.drop(columns=['host_response_time', 'neighbourhood_group_cleansed','name_bath', 'room_type'])
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
valid_set[cols] = df_num1[cols].values
#categorical encoding
valid_set['host_response_time'] = valid_set['host_response_time'].fillna("within an hour")
valid_set = pd.get_dummies(valid_set, columns=['host_response_time','neighbourhood_group_cleansed','name_bath', 'room_type'])
#drop correlated variables
valid_set = valid_set.drop(columns = ['host_response_rate_indicator','host_since_days_indicator','reviews_per_month_indicator','first_review_days_indicator','reviews_per_month_indicator','review_scores_cleanliness_indicator','review_scores_accuracy_indicator','review_scores_checkin_indicator','review_scores_communication_indicator','review_scores_accuracy_indicator','room_type_Private room','maximum_nights','name_bath_shared bath', 'host_acceptance_rate'])
#Flag Missing Values
miss_values = missing_values(test_set)
cols = miss_values.index
df_try = test_set[cols].isnull().astype(int).add_suffix('_indicator')
#merge both the df1 and the flagged columns
test_set = pd.merge(test_set, df_try, left_index=True, right_index=True)
#Iterative Imputer
df_num = test_set.drop(columns=['host_response_time', 'neighbourhood_group_cleansed','name_bath', 'room_type'])
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
test_set[cols] = df_num1[cols].values
#categorical encoding
test_set['host_response_time'] = test_set['host_response_time'].fillna("within an hour")
test_set = pd.get_dummies(test_set, columns=['host_response_time','neighbourhood_group_cleansed','name_bath', 'room_type'])
#drop correlated variables
test_set = test_set.drop(columns = ['host_response_rate_indicator','host_since_days_indicator','reviews_per_month_indicator','first_review_days_indicator','reviews_per_month_indicator','review_scores_cleanliness_indicator','review_scores_accuracy_indicator','review_scores_checkin_indicator','review_scores_communication_indicator','review_scores_accuracy_indicator','room_type_Private room','maximum_nights','name_bath_shared bath', 'host_acceptance_rate'])
#standardize the data
#sc = StandardScaler()
#X_test_std = sc.transform(X_test)
#drop most useless variables from feature selection
#for i in to_drop:
# X_test_std= X_test_std.drop(columns = [i])
#X_test.columns
Separate Predictors and Target Variable
y_train = train_set['price']
X_train = train_set.drop(columns=['price', 'price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
y_valid = valid_set['price']
X_valid= valid_set.drop(columns=['price', 'price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
y_test = test_set['price']
X_test = test_set.drop(columns=['price','price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_valid_std = sc.transform(X_valid)
X_test_std = sc.transform(X_test)
X_valid_std = pd.DataFrame(X_valid_std,columns = X_valid.columns)
X_train_std = pd.DataFrame(X_train_std,columns = X_train.columns)
X_test_std = pd.DataFrame(X_test_std,columns = X_test.columns)
RandomForest Method
#RandomForest Method
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(random_state=0)
model = randomforest.fit(X_train_std,y_train)
model.feature_importances_
pd.DataFrame(list(zip(X_train.columns,model.feature_importances_)), columns = ['predictor','feature importance']).sort_values("feature importance")
model_features = pd.DataFrame(list(zip(X_train_std.columns,model.feature_importances_)), columns = ['predictor','feature importance']).sort_values("feature importance")
model_features.tail(20)
Recursive Feature Elimination Method
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rfe = RFE(rf, n_features_to_select=50)
model_l = rfe.fit(X_train_std, y_train)
model_l_df = pd.DataFrame(list(zip(X_train_std.columns,model_l.ranking_)), columns = ['predictor','ranking'])
model_l_df
notgood = model_l_df[model_l_df['ranking'] !=1 ]
notgood
to_drop = notgood['predictor'].to_list()
#Removing the features that are useless to our model
for i in to_drop:
X_train_std= X_train_std.drop(columns = [i])
for i in to_drop:
X_valid_std= X_valid_std.drop(columns = [i])
for i in to_drop:
X_test_std= X_test_std.drop(columns = [i])
X_valid_std.shape, y_valid.shape, X_train_std.shape, y_train.shape,X_test_std.shape
X_train_std.to_csv('X_train.csv', index = False)
X_test_std.to_csv('X_test.csv', index = False)
X_valid_std.to_csv('X_valid.csv', index = False)
y_train.to_csv('y_train.csv', index = False)
y_test.to_csv('y_test.csv', index = False)
y_valid.to_csv('y_valid.csv', index = False)
from sklearn.decomposition import PCA
scaler = StandardScaler()
scaler.fit(train_set)
scaled_data = scaler.transform(train_set)
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=train_set['price'],cmap='tab20b')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
map= pd.DataFrame(pca.components_,columns=train_set.columns)
plt.figure(figsize=(12,6))
sns.heatmap(map,cmap='twilight')
from tensorflow import keras
encoder = keras.models.Sequential([
keras.layers.Dense(3, input_shape=[50]),
])
decoder = keras.models.Sequential([
keras.layers.Dense(50, input_shape=[3]),
])
autoencoder = keras.models.Sequential([encoder, decoder])
autoencoder.compile(loss='mse', optimizer = keras.optimizers.SGD(lr=0.01))
history = autoencoder.fit(X_train_std,X_train_std, epochs=20,validation_data=(X_valid_std,X_valid_std),
callbacks=[keras.callbacks.EarlyStopping(patience=5)])
codings = encoder.predict(X_test_std)
#X_test_std
#X_train_std
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Reshape
from tensorflow.keras.optimizers import SGD
##Encoder
encoder = Sequential()
encoder.add(Flatten(input_shape=[50]))
encoder.add(Dense(400,activation="relu"))
encoder.add(Dense(200,activation="relu"))
encoder.add(Dense(100,activation="relu"))
encoder.add(Dense(50,activation="relu"))
encoder.add(Dense(2,activation="relu"))
### Decoder
decoder = Sequential()
decoder.add(Dense(50,input_shape=[2],activation='relu'))
decoder.add(Dense(100,activation='relu'))
decoder.add(Dense(200,activation='relu'))
decoder.add(Dense(400,activation='relu'))
decoder.add(Dense(50, activation="relu"))
decoder.add(Reshape([50]))
### Autoencoder
autoencoder = Sequential([encoder,decoder])
autoencoder.compile(loss="mse", optimizer = keras.optimizers.SGD(lr=0.1))
autoencoder.fit(X_train_std,X_train_std,epochs=10, callbacks=[keras.callbacks.EarlyStopping(patience=5)])
encoded_2dim = encoder.predict(X_valid_std)
# The 2D
AE = pd.DataFrame(encoded_2dim, columns = ['X1', 'X2'])
AE['target'] = y_valid
#sns.lmplot(x='X1', y='X2', data=AE, hue='target', fit_reg=False, size=10)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
sns.lmplot(x='X1', y='X2', data=AE, hue='target', fit_reg=False, size=5)
plt.gca().set_xlim(0, 20)
plt.figure(figsize=(10,8))
sns.lmplot(x='X1', y='X2', data=AE, hue='target', fit_reg=True, size=5)
plt.gca().set_ylim(0, 20)
plt.gca().set_xlim(0, 10)
TSNE
#from sklearn.manifold import TSNE
#ts = TSNE()
#X_tsne = ts.fit_transform(X_train_std)
#fig, ax = plt.subplots(figsize=(6, 4))
#colors = ["rg"[j] for j in y_train['Price']]
#scat = ax.scatter(
# X_tsne[:, 0],
# X_tsne[:, 1],
# c=colors,
# alpha=0.5,
#)
#ax.set_xlabel("Embedding 1")
#ax.set_ylabel("Embedding 2")
Unsupervised Training Models are very difficult to interpret
AutoML
!pip install h2o
import h2o
from h2o.automl import H2OAutoML
# initialize the h2o session
h2o.init()
# load an h2o DataFrame from pandas DataFrame.
train_set.to_csv('automl_train.csv')
train_set.info()
train_set.describe().columns
df_test = h2o.import_file('automl_train.csv')
x = list(train_set.describe().columns)
x.remove('price')
x.remove('price_per_accommodates')
x.remove('price_per_accommodates_indicator')# remove the target
aml = H2OAutoML(max_models=10, seed=1)
df_test
pred=aml.train(x=x, y='price', training_frame=df_test)
lb = aml.leaderboard # Leader board
print(lb.head(rows=lb.nrows)) # print leader board
valid_set.to_csv('for_automl_test.csv')
test = h2o.import_file('for_automl_test.csv')
preds = aml.predict(test)
preds
list_model=h2o.as_list(lb.head(rows=lb.nrows)['model_id'], use_pandas=False)
import itertools
list_m = list(itertools.chain(*list_model))
list_m
#Model Explainability
aml.explain(test)
INSIGHTS: According to AutoML, a StackedEnsemble Model works best, followed by GBM
Now that we have an idea of what model will perform best, we can go ahead with building of the model
Additionally, we can see that a lot of important features include details regarding host, such as host's response rate, if host's response time is within an hour, the length of host's description about himself etc. Some other key features include review scores for location, the days between last reviews and current date, the number of people that can stay in the property etc.
RANDOM FOREST
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train_std, y_train)
pred1 = rf.predict(X_valid_std)
rf_mse = mean_squared_error(y_valid, pred1)
rf_rmse = np.sqrt(rf_mse)
rf_rmse
SVR
from sklearn.svm import SVR
svm_reg = SVR(kernel="linear")
svm_reg.fit(X_train_std, y_train)
predictions = svm_reg.predict(X_valid_std)
svm_mse = mean_squared_error(y_valid, pred1)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
XGBoost
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror')
xg_reg.fit(X_train_std, y_train)
preds = xg_reg.predict(X_valid_std)
xgbt_rmse = np.sqrt(mean_squared_error(y_valid, preds))
xgbt_rmse
Gradient Boosting Tree
from sklearn.ensemble import GradientBoostingRegressor
gbt = GradientBoostingRegressor(random_state=0)
model2 = gbt.fit(X_train_std,y_train)
y_test_pred_gbt = model2.predict(X_valid_std)
gbt_rmse = np.sqrt(mean_squared_error(y_valid, y_test_pred_gbt))
gbt_rmse
LGBM
from lightgbm import LGBMRegressor
# fit the model on the whole dataset
lgbm_reg_model = LGBMRegressor()
lgbm_reg_model.fit(X_train_std, y_train)
#Testing
lgbm_reg_pred = lgbm_reg_model.predict(X_valid_std)
lgbm_reg_mse = mean_squared_error(y_valid, lgbm_reg_pred)
lgbm_reg_rmse = np.sqrt(lgbm_reg_mse)
lgbm_reg_rmse
Graphing RMSEs
df = {'Models': ["Random Forest Regressor","Gradient Boosting Regressor", 'XG Boost','LightGBM','SVR'],
'RMSE': [rf_rmse,gbt_rmse,xgbt_rmse,lgbm_reg_rmse,svm_rmse]
}
summary = pd.DataFrame(df)
plt.figure(figsize=(8, 6))
splot=sns.barplot(x="RMSE",y="Models",data=summary)
plt.xlabel("RMSE", size=14)
plt.ylabel("Models", size=14)
THE BEST MODEL IS: LIGHTGBM
### Hyperparameter Tuning with MLFlow
!pip install mlflow
!pip install hyperopt
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
from sklearn.model_selection import cross_val_score
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from hyperopt.pyll import scope
from IPython.display import Image
import numpy as np
import lightgbm as lgb
from lightgbm import LGBMModel,LGBMRegressor
hyperparameters = {"max_depth":scope.int(hp.quniform("max_depth",2,100,5)),
"n_estimators":scope.int(hp.quniform("n_estimators",2,100,1)),
"num_leaves": scope.int(hp.quniform("num_leaves",2,50,1)),
"reg_alpha": hp.loguniform('reg_li',-5,5),
"random_state":1,
"learning_rate": hp.loguniform("learning_rate", np.log(0.01), np.log(0.5)),
"min_child_weight": hp.uniform('min_child_weight', 0.5, 10),
"boosting": hp.choice("boosting",["gbdt","dart","goss"]),
"objective":"regression"}
def train_model(parameters):
mlflow.lightgbm.autolog()
with mlflow.start_run(nested=True):
booster = lgb.LGBMRegressor()
booster.set_params(**parameters)
booster.fit(X_train_std,y_train)
mlflow.log_params(parameters)
score = cross_val_score(booster, X_train_std, y_train, cv=5, scoring = "neg_mean_squared_error",n_jobs=-1)
mean_score = np.mean(score)
mlflow.log_metric('neg_mean_squared_error', mean_score)
return{'status':STATUS_OK,
"loss":-1*mean_score,
'booster':booster.get_params}
with mlflow.start_run(run_name='lightgbm_tuning'):
best_params = fmin(
fn=train_model,
space=hyperparameters,
algo=tpe.suggest,
max_evals = 5,
trials = Trials(),
rstate=np.random.RandomState(1))
best_params
Test Model on Test Set
lgbm_reg_model = LGBMRegressor(boosting_type='gbdt',
learning_rate=0.01786582742105907,
max_depth=50,
min_child_weight=8.640232244795891,
n_estimators=39,
num_leaves=5, n_jobs=-1)
lgbm_reg_model.fit(X_train_std, y_train)
#Testing
lgbm_reg_pred = lgbm_reg_model.predict(X_test_std)
lgbm_reg_mse = mean_squared_error(y_test, lgbm_reg_pred)
lgbm_reg_rmse = np.sqrt(lgbm_reg_mse)
lgbm_reg_rmse
This is a tool where the Airbnb users can input the different predictors, and get an output of the suggested price for their listing
from sklearn.model_selection import cross_val_score
import numpy as np
import lightgbm as lgb
from lightgbm import LGBMModel,LGBMRegressor
col_list=list(pd.read_csv('X_train.csv').columns)
ui_case1_model = LGBMRegressor(boosting_type='gbdt',
learning_rate=0.01786582742105907,
max_depth=50,
min_child_weight=8.640232244795891,
n_estimators=39,
num_leaves=5, n_jobs=-1)
ui_case1_model.fit(X_train[col_list], y_train)
col_list
check_list=['host_total_listings_count',
'host_identity_verified',
'accommodates',
'bedrooms',
'beds',
'number_of_reviews',
'instant_bookable',
'reviews_per_month',
'name_length',
'description_length',
'host_about_length',
'verifications_length',
'amenities_length',
'host_since_days',
'first_review_days',
'last_review_days',
'num_bath',
'breakfast_available',
'tv_available',
'dishwasher_available',
'washer and dryer_available',
'gym',
'parking']
import gradio as gr
def greet(type_users,host_total_listings_count,host_identity_verified,accommodates,bedrooms,beds,number_of_reviews,instant_bookable,reviews_per_month,name_length,description_length,host_about_length,verifications_length,number_of_amenities_provide,
host_since_days,first_review_days,last_review_days,num_bath,breakfast_available,tv_available,dishwasher_available,washer_and_dryer_available,gym,parking):
list_test=[]
host_identity_verified= 1 if host_identity_verified=='Yes' else 0
instant_bookable= 1 if instant_bookable=='Yes' else 0
breakfast_available= 1 if breakfast_available=='Yes' else 0
tv_available= 1 if tv_available=='Yes' else 0
dishwasher_available= 1 if dishwasher_available=='Yes' else 0
washer_and_dryer_available= 1 if washer_and_dryer_available=='Yes' else 0
gym= 1 if gym=='Yes' else 0
parking= 1 if parking=='Yes' else 0
name_length=len(name_length.split())
description_length=len(description_length.split())
host_about_length=len(host_about_length.split())
verifications_length=3
amenities_length=number_of_amenities_provide
check=[host_total_listings_count,host_identity_verified,accommodates,bedrooms,beds,number_of_reviews,instant_bookable,reviews_per_month,name_length,description_length,host_about_length,verifications_length,amenities_length,host_since_days,first_review_days,last_review_days,num_bath,breakfast_available,tv_available,dishwasher_available,washer_and_dryer_available,gym,parking]
for i in col_list:
if i in check_list:
list_test.append(check[check_list.index(i)])
else:
list_test.append(X_train[i].mode())
greeting = "Dear {}, Here is our Estimation for the Airbnb price :)".format(type_users)
price=ui_case1_model.predict([list_test])
print(price)
return greeting,str(round(price[0],2))+' USD'
iface = gr.Interface(
fn=greet,
inputs=[gr.inputs.Radio(['New Host','Host','Guest'], label="I am a"),
gr.inputs.Dropdown([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], label="Total Properties"),
gr.inputs.Radio(['Yes','No'], label="If Host has a Verified Identity"),
gr.inputs.Dropdown([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], label="Accomodates"),gr.inputs.Slider(0, 100),\
gr.inputs.Slider(0, 100),gr.inputs.Slider(0, 1000),gr.inputs.Radio(['Yes','No'], label="If it's instant bookable"),
gr.inputs.Slider(0, 100), 'text','text','text','text',gr.inputs.Dropdown([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]),
gr.inputs.Slider(0, 3000),gr.inputs.Slider(0, 3000),gr.inputs.Slider(0, 1000),
gr.inputs.Dropdown([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]),
gr.inputs.Radio(['Yes','No'], label="Provide Breakfast?"),
gr.inputs.Radio(['Yes','No'], label="Provide TV?"),
gr.inputs.Radio(['Yes','No'], label="Probide Dishwasher"),
gr.inputs.Radio(['Yes','No'], label="Probide Laundry"),
gr.inputs.Radio(['Yes','No'], label="Probide Gym"),
gr.inputs.Radio(['Yes','No'], label="Probide Parking")],
outputs=['text',"text"])
iface.launch()
Please refer to the git to see the UI interface.

Edit the X_train to exclude data that may cause data leakage
df3.columns
df_new = df3[['host_total_listings_count',
'host_has_profile_pic', 'host_identity_verified',
'neighbourhood_group_cleansed', 'room_type', 'accommodates', 'bedrooms',
'beds', 'price', 'minimum_nights', 'maximum_nights', 'availability_90',
'availability_365', 'instant_bookable',
'name_length', 'description_length', 'host_about_length',
'verifications_length', 'amenities_length', 'price_per_accommodates',
'num_bath', 'name_bath', 'air_conditioning_available', 'tv_available',
'coffee_machine_available', 'cooking_basics', 'dishwasher_available',
'washer and dryer_available', 'gym', 'parking',
'long_term_stays_allowed', 'private_entrance', 'microwave_available']]
df_new.head()
train_new_set,valid_new_set=split_train_test(df_new,0.3)
print("The length of train_new set is: ",len(train_new_set))
print("The length of valid_new set is: ",len(valid_new_set))
valid_new_set,test_new_set=split_train_test(valid_new_set,0.4)
print("The length of valid_new set is: ",len(valid_new_set))
print("The length of test_new set is: ",len(test_new_set))
train_new_set.shape, valid_new_set.shape, test_new_set.shape
Pre-Process Training Set
#Flag Missing Values
miss_values = missing_values(train_new_set)
cols = miss_values.index
df_try = train_new_set[cols].isnull().astype(int).add_suffix('_indicator')
#merge both the df1 and the flagged columns
train_new_set = pd.merge(train_new_set, df_try, left_index=True, right_index=True)
#Iterative Imputer
df_num = train_new_set.drop(columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
train_new_set[cols] = df_num1[cols].values
#categorical encoding
train_new_set = pd.get_dummies(train_new_set, columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
#Correlation
corr_matrix = train_new_set.corr()
mat = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
mat
data_corr2 = train_new_set.corr()
plt.figure(figsize=(30, 15))
heatmap1 = sns.heatmap(data_corr2, vmin=-1, vmax=1, annot=True)
heatmap1.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)
There are no highly correlated variables
#Outlier Treatment
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=100, random_state=42, contamination=0.02)
pred = iforest.fit_predict(train_new_set)
score = iforest.decision_function(train_new_set)
from numpy import where
anom_index = where(pred== -1)
values2 = train_new_set.iloc[anom_index]
values2
train_new_set = train_new_set[~train_new_set.index.isin(values.index)]
train_new_set.shape
#Separate target and predictors
y_train_new = train_new_set['price']
X_train_new = train_new_set.drop(columns=['price', 'price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
#standardize
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_new_std = sc.fit_transform(X_train_new)
X_train_new_std = pd.DataFrame(X_train_new_std,columns = X_train_new.columns)
#feature selection
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rfe = RFE(rf, n_features_to_select=50)
model_l = rfe.fit(X_train_new_std, y_train_new)
model_l_df2 = pd.DataFrame(list(zip(X_train_new_std.columns,model_l.ranking_)), columns = ['predictor','ranking'])
model_l_df2
notgood2 = model_l_df2[model_l_df2['ranking'] !=1 ]
notgood2
All features are important!
Pre-Process Validation Set
#Flag Missing Values
miss_values = missing_values(valid_new_set)
cols = miss_values.index
df_try = valid_new_set[cols].isnull().astype(int).add_suffix('_indicator')
#merge both the df1 and the flagged columns
valid_new_set = pd.merge(valid_new_set, df_try, left_index=True, right_index=True)
#Iterative Imputer
df_num = valid_new_set.drop(columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
valid_new_set[cols] = df_num1[cols].values
#categorical encoding
valid_new_set = pd.get_dummies(valid_new_set, columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
#separate target and predictors
y_valid_new = valid_new_set['price']
X_valid_new = valid_new_set.drop(columns=['price', 'price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
#standardize
X_valid_new_std = sc.transform(X_valid_new)
X_valid_new_std = pd.DataFrame(X_valid_new_std,columns = X_valid_new.columns)
Pre-Process Test Set
#Flag Missing Values
miss_values = missing_values(test_new_set)
cols = miss_values.index
df_try = test_new_set[cols].isnull().astype(int).add_suffix('_indicator')
#merge both the df1 and the flagged columns
test_new_set = pd.merge(test_new_set, df_try, left_index=True, right_index=True)
#Iterative Imputer
df_num = test_new_set.drop(columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
imp = IterativeImputer(random_state=0)
df_num1 = imp.fit_transform(df_num)
cols = list(df_num)
df_num1=pd.DataFrame(df_num1)
df_num1.columns=cols
test_new_set[cols] = df_num1[cols].values
#categorical encoding
test_new_set = pd.get_dummies(test_new_set, columns=['neighbourhood_group_cleansed','name_bath', 'room_type'])
#separate target and predictors
y_test_new = test_new_set['price']
X_test_new = test_new_set.drop(columns=['price', 'price_per_accommodates_indicator','price_per_accommodates']) ##taking anything related to price to avoid data leakage
#standardize
X_test_new_std = sc.transform(X_test_new)
X_test_new_std = pd.DataFrame(X_test_new_std,columns = X_test_new.columns)
X_valid_new_std.shape, y_valid_new.shape, X_train_new_std.shape, y_train_new.shape
TRAIN DIFFERENT MODELS
#RandomForest
rf2 = RandomForestRegressor(n_estimators=100, random_state=0)
rf2.fit(X_train_new_std, y_train_new)
pred2 = rf2.predict(X_valid_new_std)
rf_mse2 = mean_squared_error(y_valid_new, pred2)
rf_rmse2 = np.sqrt(rf_mse2)
print("Random Forest:" , rf_rmse2)
#SVR
svm_reg2 = SVR(kernel="linear")
svm_reg2.fit(X_train_new_std, y_train_new)
predictions2 = svm_reg2.predict(X_valid_new_std)
svm_mse2 = mean_squared_error(y_valid_new, predictions2)
svm_rmse2 = np.sqrt(svm_mse2)
print("SVR:" , svm_rmse2)
#XGBoost
xg_reg2 = xgb.XGBRegressor(objective ='reg:squarederror')
xg_reg2.fit(X_train_new_std, y_train_new)
preds2 = xg_reg2.predict(X_valid_new_std)
xgbt_rmse2 = np.sqrt(mean_squared_error(y_valid_new, preds2))
print("XGBoost:" , xgbt_rmse2)
#GradientBoostingTree
gbt2 = GradientBoostingRegressor(random_state=0)
model2 = gbt2.fit(X_train_new_std,y_train_new)
y_test_pred_gbt2 = model2.predict(X_valid_new_std)
gbt_rmse2 = np.sqrt(mean_squared_error(y_valid_new, y_test_pred_gbt2))
print("GBT:" , gbt_rmse2)
#LightGBM
# fit the model on the whole dataset
lgbm_reg_model2 = LGBMRegressor()
lgbm_reg_model2.fit(X_train_new_std, y_train_new)
lgbm_reg_pred2 = lgbm_reg_model2.predict(X_valid_new_std)
lgbm_reg_mse2 = mean_squared_error(y_valid_new, lgbm_reg_pred2)
lgbm_reg_rmse2 = np.sqrt(lgbm_reg_mse2)
print("LGBM:" , lgbm_reg_rmse2)
df2 = {'Models': ["Random Forest Regressor","Gradient Boosting Regressor", 'XG Boost','LightGBM','SVR'],
'RMSE': [rf_rmse2,gbt_rmse2,xgbt_rmse2,lgbm_reg_rmse2,svm_rmse2]
}
summary2 = pd.DataFrame(df2)
plt.figure(figsize=(8, 6))
splot=sns.barplot(x="RMSE",y="Models",data=summary2)
plt.xlabel("RMSE", size=14)
plt.ylabel("Models", size=14)
XGBOOST is the best performing model in Use Case 2
HYPERPAREMETER TUNING WITH MLFLOW
import mlflow.xgboost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
search_space = {"max_depth":scope.int(hp.quniform("max_depth",2,50,5)),
"n_estimators":scope.int(hp.quniform("n_estimators",50,100,1)),
#"num_leaves": scope.int(hp.quniform("num_leaves",2,50,1)),
"reg_alpha": hp.loguniform('reg_li',-5,5),
"random_state":1,
"learning_rate": hp.loguniform("learning_rate", np.log(0.01), np.log(0.5)),
"min_child_weight": hp.uniform('min_child_weight', 0.5, 10),
#"boosting": hp.choice("boosting",["gbdt","dart","goss"]),
"objective":"reg:squarederror"
}
def train_model(parameters):
mlflow.xgboost.autolog()
with mlflow.start_run(nested=True):
booster = xgb.XGBRegressor()
booster.set_params(**parameters)
booster.fit(X_train_new,y_train_new)
mlflow.log_params(parameters)
score = cross_val_score(booster, X_train_new, y_train_new, cv=5,
scoring = "neg_mean_squared_error",n_jobs=-1)
mean_score = np.mean(score)
mlflow.log_metric('neg_mean_squared_error', mean_score)
return{'status':STATUS_OK,
"loss":-1*mean_score,
'booster':booster.get_params}
with mlflow.start_run(run_name='airbnb'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals = 10,
trials = Trials(),
rstate=np.random.RandomState(1)
)
best_params
TEST FINAL MODEL ON TEST SET
xg_reg2 = xgb.XGBRegressor(booster='gbtree', learning_rate= 0.06884784274135033,
max_depth=10,
min_child_weight=4.080472823651638,
n_estimators=53,
#reg_li=0.07394056191173794
)
xg_reg2.fit(X_train_new_std, y_train_new)
preds2 = xg_reg2.predict(X_test_new_std)
xgbt_rmse2 = np.sqrt(mean_squared_error(y_test_new, preds2))
print("XGBoost:" , xgbt_rmse2)
“Starbucks Effect” is the created term to describe the phenomena of how a Starbucks store opening increases home and property values nearby.
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
df_airbnb=pd.read_csv(('http://data.insideairbnb.com/united-states/ny/new-york-city/2021-02-04/data/listings.csv.gz'))
df_airbnb['neighbourhood_group_cleansed'].value_counts()
df1.groupby('neighbourhood_group_cleansed')['price_per_accommodates'].mean()
df_places = gpd.read_file('new-york.geojson')
df_places['airbnb_num']=[289,4704,14474,16553,992]
df_places['starbucks_num']=[36,50,50,223,50]
df_places['price_per_acc']=[28.97,32.44,39.97,55.44,31.93]
f, ax = plt.subplots(1, figsize=(15, 12))
ax =df_places.plot(column='airbnb_num',ax=ax,legend=True)
plt.show()
f, ax = plt.subplots(1, figsize=(15, 12))
ax =df_places.plot(column='starbucks_num',ax=ax,legend=True)
plt.show()
f, ax = plt.subplots(1, figsize=(15, 12))
ax =df_places.plot(column='price_per_acc',ax=ax,legend=True)
plt.show()
It seems like the Starbucks effect May be at play here. However, we conducted Causal Analysis using DoWhy to assess if the effect is significant (see CausalML folder in Git repository)
PLEASE NORE: this was conducted more for exercise purpose than to get any insights
import numpy as np
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.semi_supervised import LabelSpreading
import numpy as np
import random
from sklearn.metrics import mean_squared_error
from numpy import concatenate
from sklearn.model_selection import train_test_split
def runLP(x,target,x_test,target_test,n):
data = x
labels = target
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(data, labels, test_size=n, random_state=123)
#RUN THE MODEL
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
print(y_train_mixed)
model = LabelSpreading(max_iter=100)
model.fit(X_train_mixed, y_train_mixed)
pred = np.array(model.predict(x_test))
#SEPARATE PREDICTED SAMPLES
print(model.predict(x).sum())
#PRINT CONFUSION MATRIX
return model, mean_squared_error(target_test, pred),target_test,pred
#train_set.columns
#train_set.describe().columns
target=train_set['price']
x=train_set[train_set.describe().columns].drop(['price'],axis=1)
x=x.to_numpy()
target_test=test_set['price']
x_test=test_set[test_set.describe().columns].drop(['price'],axis=1)
x_test=x_test.to_numpy()
target_test.to_numpy()
pd.DataFrame(pred).describe()
train_set.columns